The main problem that we will be working on this semester is to detect grammatical errors and mistakes in written English text and identify (tag) them. Then, rewrite sentence with no grammatical mistakes. The goal is to train a model that is able to detect mistakes in spelling, punctuation, and grammar.
The datasets used are English text corpuses created synthetically or used by shared tasks like BEA19 and CoNLL2014. They could be formatted in plain English text or in M2 format standardised using the ERRANT framework (more information about M2 format and ERRANT can be found in references below). The following are the datasets used for training:
The input to the model is an Enlgish sentence that can contain grammatical errors. The output of the model is the English sentence rewritten to omit any grammatical errors it has.
Currently, the state-of-the-art model for grammatical error correction is the T5 (Text-to-Text-Transfer-Transformer) model by Google. This model produces state-of-the-art results for all languages except English. It is a transformer encoder-decoder model, and it comes in different sizes with the smallest base model of 600 million parameters and the largest with 13 billion parameters. Their base model was inferior to the current SOTA model, but the large model with 11 billion parameters achieved SOTA results in Czech, German, and Russian. Later with a new T5 XXL model with 11B parameters, they were able to achieve SOTA results on all languages that the model was trained on.
We adopted the GECToR model for the GEC problem. The main approach is simplifying the task from sequence generation to sequence tagging. GECToR sequence tagging model architecture is an encoder made up of a pre-trained BERT-like transformer stacked with two linear layers with softmax layers on the top. The two linear layers are responsible for mistake detection and error tagging, respectively.
GECToR develops custom token-level transformations to recover the target text by applying them to the source tokens. The transformations increase the coverage of grammatical errors. The edit space consists of mainly basic transformations and some g-transformations.
We proposed some updates to the original model using new approaches, which can help some of its shortcomings and produce better result.
More data was shown to be beneficial for GEC problems, so we augmented the development data with the gold annotations of multiple shared tasks into the fine-tuning stage.
We applied heuristic dictionary levenshtein distance spell-checking techniques on data before training model. Our thought process is to decrease edit space, which could help model detect more grammatical errors. However, after testing, the scores were close with the spell-checked data having slightly better accuracy, but it is due to more imbalance in the data.
With the development of more recent transformers than the ones mentioned in the paper, we wanted to try and test the model on different types of transformers. The transformers
tested are the ELECTRA-generator model and the ELECTRA-discriminator model.
We wondered why not use an encoder-decoder transformer but removing the decoder. The paper hypothesizes that encoders from encoder-decoder transformers (NLG) are less useful for GEC models than transformer-encoder models (NLU), but no experiments were done. Therefore, we decided to test the hypothesis using the encoder from the T5 transformer model, since the T5 produces SOTA results in most languages. The T5 architecture:
We decided to try different optimizers, especially with the T5 model where it is recommended to use the Adagrad optimizer. We tested the T5 model and the ELECTRA-discriminator model using the default Adam optimizer, the Adafactor optimizer, and SGD optimizer. We felt that those optimizers are the most probable to produce the best results, and due to the large training times, we decided to stick with these 3 optimizers.
The paper mentions 2 inference tweaks the greatly inlfuence performance:
In the paper, they achieved better results when using an ensemble of models. So we can do the same while using the ELECTRA discriminator and T5 models with the pretrained models of XLNet and RoBERTa. The main motive behind using an ensemble of models is trying to balance the precision and recall scores. We tried a number of combinations:
After applying the updates proposed, we evaluated the models on the CoNLL2014 shared task and the BEA19 shared task.
Some technical details related to training:
Now that we know that NLG discriminator networks are very well suited for this task. We want to train and finetune the largest ELECTRA pretrained discriminator model (ELECTRA-1.75M with 335 Million parameters) as it was shown on other tasks to enhance the score on average by around 5%. We want to also experiment with other discriminator networks other than ELECTRA. Moreover, as we learned from our experiments, ELECTRA is much faster to train when compared to other models. Thus, we suggest increasing the tag vocab size from 5000 to 10000 to increase the coverage of errors as explained by Grammarly.
All references used, which includes papers, datasets, and other sources which we took information from.